Exploratory Data Analysis¶

Owner: Daniel Soukup - Created: 2025.11.01

The goal of this notebook is to explore the dataset, understand our target column and features (statistical properties, data quality) and plan for preprocessing and modelling.

Data Loading¶

Ensure the datasets are available:

Load the datasets:

We have ~200K rows for testing and ~100K rows for testing with 42 columns, the last being our target. Note that we are missing the column names.

Observations:

It is clear that we need to do a fair bit of cleaning:

Excerpt from the data dict:

A quick check confirms that there are no exact duplicate columns per se:

Column Mapping¶

Lets look at the high level stats first:

We'll use the data dictionary and number of unique values to map our columns to their proper names. We also need this information to make well-informed decisions on how the columns should be processed for modelling.

Based on this we put together our mapping:

About the instance weight:

The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should not be used in the classifiers, so it is set to "ignore" in this file.

We are ready to map the names and look into the mystery column:

Univariate distributions¶

We look at the single distributions for each column that will help us deciding on any preprocessing.

Numerical Fields¶

Next, look at the numeric fields:

There are few fields which we should treat as categorical instead since the numbers encode independent values with no numeric relationship:veterans benefits, own business, and the two the recodes.

One might consider adding "num persons worked for employer" and "year" as well, given the concentration to only a few unique values.

Let's convert these and process with the rest of the categorical columns:

Look at high level stats:

We see some major skew in these fields and some curious values, such as the 9999 and 99999 for the dollar values.

Observations:

It is likely that either 9999 was used as a filler value in these columns or used as an artificial maximum to protect the privacy of some outlier data subjects.

These are extreme outlier values of the fields (based on the percentiles above). We also saw that there are a really high % of values falling exactly at 0:

This allows us to see the distribution better.

Proposal

In the future, we can apply log-transformation, scaling to these fields if the models require. We can alternatively consider binning to create categorical fields from these variables too (based on uniform or percentile bins).

Categorical fields¶

Lets consider now the categorical fields.

We have a highly imbalance distribution, with only 6% of train samples in the +50K category. This will likely need addressing, the classification models will be affected and in general, we expect worse performance on this minority class.

Lets see the rest of the columns:

We see most columns are again imbalanced with a high number of unique values and long tails.

If all columns are dummy encoded we get > 450 cols (once one of each category dropped):

In terms of encoding and processing:

Proposal:

Missing Values¶

We see there is only a single column with missing values, for <0.5% of the rows:

From the data glossary provided:

Hispanic Origin Persons of Hispanic origin in this file are determined on the basis of a question asking if the person is Spanish, Hispanic, or Latino. If the response is ā€œyes,ā€ a follow-up question determines a specific ethnic origin, asking to select their (the person’s) origin from a ā€œflash cardā€ listing. The flash-card selections are Mexican, Mexican- American, Chicano, Puerto Rican, Cuban, Cuban American, or some other Spanish, Hispanic, or Latino group.

If these values are missing at random, dropping them likely make little difference to the modelling.

It does not seem to be skewed wrt our target or the 'race' column.

Proposal: Since the hispanic origin column already has a Do not know value, we can fill with that.

Duplicate Rows¶

Lets see how many duplicate rows we have:

This is already significant, but if we look at without our mistery column:

We can see that without our mistery column, we get the approximate number of duplicate instances mentioned in the data dictionary.

Number of instances data = 199523 Duplicate or conflicting instances : 46716 Number of instances in test = 99762 Duplicate or conflicting instances : 20936 Class probabilities for income-projected.test file

It is unclear where these are coming from and there is no ID to identify rows (unsurprisingly). Its also a very large number of rows to drop from our data and it could be that some of these duplicates are legitimate (two people with the same recorded attributes).

Note: the Census does put a fair effort into deduplication. https://www.census.gov/newsroom/blogs/random-samplings/2021/04/how_we_unduplicated.html

There are several reasons for duplicates in a census: We receive more than one response for an address. People are counted in more than one place because of potentially complex living situations. There might be an issue with the address — a housing unit is on our address list more than once or census materials are misdelivered. We use a special algorithm to resolve the first situation and a series of steps to resolve the second and third.

Relationship to Target¶

We'd like to get a sense which columns are most strongly related to the target. We can do this by standard statistics and tests (chi2 or f-score). We leave the feature selection to the preprocessing notebook and explore select columns here that are promising predictors.

Only weak relationship between the numeric features.

Between our target and the numeric features:

After a bit of transformations, we can get a better visual:

Observations:

Finally, looking at our categorical fields, we can check pivot tables and chi2 scores:

We can see that these are reasonable columns, industry, class of worker, employment status, etc are highly likely to affect the income. We expect these to show up in our modelling as well.